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Abstract 

This paper presents a framework that supports the imple¬ 
mentation of parallel solutions for the parametric maximum 
flow computational models widely used in image segmenta¬ 
tion algorithms. The framework is based on supergraphs, a 
special construction combining several image graphs into a 
larger one, and works on various architectures (multi-core 
or GPU), either locally or remotely in a cluster of comput¬ 
ing nodes. The framework can also be used for performance 
evaluation of parallel implementations of maximum flow al¬ 
gorithms. We present the case study of a state-of-the-art 
image segmentation algorithm based on graph cuts. Con¬ 
strained Parametric Min-Cut (CPMC), that uses the paral¬ 
lel framework to solve parametric maximum flow problems, 
based on a GPU implementation of the well-known push- 
relabel algorithm. Our results indicate that real-time imple¬ 
mentations based on the proposed techniques are possible. 

1 Introduction 

Recent advances in image segmentation Qa have led to 
improved accuracy over large and diverse image datasets 
iiiisi, by almost doubling the performance figures. This 
development has spurred the interest for the widespread use 
of image segmentation models (figure[2l as a component for 
key tasks in computer vision, such as video segmentation, 
large-scale applications for recognition and classification or 
mobile computing. In this context, of particular importance 
becomes the real-time performance of image segmentation 
algorithms. Although reliant on advanced methodology and 
data structures, the running times of the best performing al¬ 
gorithms still lag behind real-time, taking a few minutes for 
usual images, on average. 

The most advanced image segmentation algorithms in¬ 
volve repeatedly solving multiple maximum-flow problems 
over monotonic schedules of parameter scales (parametric 
max-flow) constrained at image “seeds” corresponding to 
different locations in an image. Each image can be repre¬ 
sented as a graph, where each pixel is a node connecting 
locally with spatially adjacent ones (e.g. up, down, left and 


right), and connection strengths are modulated by pixel in¬ 
tensity similarity, or the presence of image contours. Solv¬ 
ing each max-flow problem for one setting of the parameters 
is equivalent to computing a binary partition on the image 
graph. Performed systematically, at different locations and 
for monotonic schedules of parameters, it has been empir¬ 
ically observed that the process generates multiple binary 
segmentation hypotheses with good spatial overlap with the 
different objects and scene structures present in images (see 
figure[^. Often, the hypothesis generation is initiated from 
different seeds independently, suggesting an inherently high 
degree of parallelism. Therefore, a trivially parallel imple¬ 
mentation that generates solutions by running parametric 
maximum flow ifTSlflfill independently for each seed seems 
appropriate. 


Figure 1 . (best seen in color) Original image, pool 
of segments generated by a max-flow segmentation 
algorithm, and ground truth respectively. 

However, the high computational cost of generating seg¬ 
ment hypotheses once a location (seed) has been selected 
suggests parallelizing the parametric maximum flow proce¬ 
dure as an alternative way to speed up image segmentation. 
Currently, as far as we know, there are no available parallel 
implementations of a parametric maximum flow algorithm. 
In this paper, we present the design of a general framework 
that can use existing parallel graph cut solutions such as 
GridCut ||T| or CUBA NPPI ||2l E) to implement a parallel 
algorithm that approximates parametric maximum flow be¬ 
havior. To this end, we use supergraphs, a special construc¬ 
tion that combines several image graphs, each having edge 
weights (or capacities) that depend on a different parameter, 
into a larger one. 

The framework is general in terms of the architectures 





it can use. Supergraphs can run on multi-core processors 
or GPU boards, either locally or distributed in a cluster. A 
parametric maximum flow problem encoded as a collection 
of supergraph cut problems can be scheduled dynamically 
to run on a heterogeneous collection of computing nodes. 
The dynamic scheduler efficiently adapts not only to the im¬ 
balances induced by the heterogeneous architectures used, 
but also to those intrinsic to the problem, as each problem 
takes a different amount of time, depending on the image 
complexity. 

The paper also presents a case study of a state-of-the- 
art image segmentation algorithm. Constrained Parametric 
Min-Cut (CPMC) Cni, that uses the parallel framework 
with NVIDIA’S GPU implementation of the well-known 
push-relabel maximum flow algorithm m. The compar¬ 
ison to a CPMC solution based on a sequential pseudoflow 
algorithm m in a trivially parallel setup (where instances 
of a segmentation problem are executed independently on 
several cores of a processor) helps us understand how close 
can we get to a real-time solution for image segmentation. 

To summarize, the paper has three main contributions: 
(1) a parallel solution for parametric maximum flow 
problems based on supergrapbs that can be used in image 
segmentation algorithms; (2) a general, parallel frame¬ 
work for parametric maximum-flow problems that can 
(a) handle various hardware architectures, multi-core or 
GPU, both locally and remote in a cluster, (b) efficiently 
schedule problems to achieve improved segmentation times, 
and (c) act as a performance evaluation tool by allowing the 
use of various implementations of parallel maximum flow 
algorithms; and (3) a case study for a state-of-the-art im¬ 
age segmentation algorithm (CPMC). 

The paper is organized as follows: ^describes shortly 
previous work on graph cuts and image segmentation for a 
clear understanding of the concepts. In Q we present the 
framework solutions to parallelize the parametric maximum 
flow algorithms, using supergraphs, ^presents the results 
of our case study evaluation of CPMC. We conclude in ^ 

2 Graph-Cut based Image Segmentation 

Graph cuts can be used to segment an image into a fore¬ 
ground object and the rest of the image, usually referred to 
as background, in order to obtain a figure-ground segmenta¬ 
tion. This is a form of binary classification, with 1 assigned 
to foreground pixels and 0 to background. 

The binary inference (labeling) process is performed 
by running a maximum flow/minimum cut algorithm on a 
graph whose vertices represent the pixels in the image. Two 
special vertices, the source s and the sink t are connected to 
every vertex of the graph by means of weighted edges (see 
figure]^; the weights are called edge capacities. For image 
segmentation, the source and sink are associated with the 
two labels that will be used to distinguish the foreground ob¬ 


ject from the background. The weights of the edges that link 
s and t to the graph vertices (the pixels) quantify a penalty 
expressing how correct is to assign that pixel to either of the 
two classes of labels represented by the source and the sink. 

Source s Source s 



Graph G A cut in graph G 

Figure 2. Associated image graph and a cut example 

(best seen in colors). 

Regular graph vertices (corresponding to image pixels) 
are linked to each other by weighted edges as well. Typ¬ 
ically, image segmentation models use the weights of the 
edges that connect each vertex to its nearby neighbors (up, 
down and laterally) to model smoothness, i.e. the assump¬ 
tion that nearby pixels are likely to have similar labels. 

An s-t cut of the graph is a partitioning of the vertices 
into two disjoint subsets: one containing vertex s and the 
other one containing vertex t. The cost of the cut is defined 
to be the sum of the weights of those edges in the graph 
that have one vertex in the s-partition and the other in the 
t-partition. A minimum cut corresponds to those graph cuts 
that have minimum cost. 

A graph cut induces a labeling of the image pixels, de¬ 
pending on which partition they were inferred to. The prob¬ 
lem of finding a cut is equivalent to the one of minimizing 
an energy defined on the graph. The energy has two terms, 
depending on which type of edges the cut crosses: edges 
linking either i or f to a regular vertex (pixel), or regular 
edges that link neighboring pixels. The first category of 
terms is called “data” or “unary” terms, while the second 
accounts for the “pairwise” terms (regularization terms). A 
minimum cut in such an image graph corresponds to a min¬ 
imum energy among all of the possible label configurations 
of the image graph. 

Greig et al. have used this method for the first time 
to smooth noisy images and showed that the maximum a 
posteriori estimate of a binary image corresponds exactly 
to the maximum flow in the associated image graph con¬ 
structed as previously described. According to the Ford and 
Fulkerson theorem ifT^ . a maximum flow from i to f satu¬ 
rates the sum of the capacities of a set of edges in the graph 
that partitions the vertices into two disjoint sets that actually 
correspond to a minimum cut in the graph. 

There are many polynomial time algorithms that solve 
the maximum flow problem (see mi), including augment- 
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ing path (Ford-Fulkerson ITJ)) and push-relabel ifT^ algo¬ 
rithms, but their presentation is beyond the scope of this 
paper. An augmenting path algorithm widely used in com¬ 
puter vision is due to Boykov and Kolmogorov JT). An ex¬ 
tended view on the use of graph cuts in computer vision can 
be found in 0 . 



Figure 3. (best seen in color) A regular grid of seeds 
in an image. Binary partitions (segmentations) are 
extracted around regularly placed foreground seeds 
(green dots) that express the foreground bias, while 
background seeds (in blue) are placed usually on the 
border of the image. Each generated segment corre¬ 
sponds to a graph cut segmentation problem. For an 
entire image, multiple independent solutions are gen¬ 
erated at the cost of heavy graph cut computations. 

2.1 Parametric Max Flow & Image Segmentation 

Parametric max flow algorithms lfT3l l2T1l are used in im¬ 
age segmentation to generate a set of hypotheses for plausi¬ 
ble object segments in a given image. They are able to op¬ 
timize energies where the unknowns are both the binary la¬ 
bels of pixels and the weighting (scale) A between the unary 
and pairwise terms of the energy model. The A values for 
which the corresponding energy value changes are called 
“breakpoints” and mark the optimal solutions of a paramet¬ 
ric max flow problem. In the “monotonic” case where the 
factors multiplying the parameter A in the unary (data) en¬ 
ergy terms are all non-negative or non-positive, the opti¬ 
mal solutions are nested ifTSll and an efficient implementa¬ 
tion of the parametric maximum flow algorithm is possible. 
The algorithms can either compute all breakpoints (an up¬ 
per bound is the number of the graph nodes) or a subset of 
them. Either way, monotonicity makes the calculation sig¬ 
nificantly more efficient as earlier computations are reused. 
In practice, a preset list of parameter values (usually de¬ 
fined on a logarithmic scale), the so called X-schedule, can 
be used instead of computing all the breakpoints, as em¬ 
pirical evaluations cni have shown that the ground truth 
covering stays almost the same, at significantly lower com¬ 
putational cost due to the reduced number of breakpoints 
generated (and thus, a reduced number of segment hypothe¬ 
ses). We say that this type of run “approximates” parametric 
max flow behavior. 

Graph cut problems (preferably monotonic) are associ¬ 


ated with different seeds in order to generate a pool of seg¬ 
ments with high probability of (foreground) object overlap. 
A seed is a set of pixels “frozen”, by construction, to belong 
to either foreground or background. The foreground seeds 
are usually placed regularly on a grid in the image, whereas 
the background seeds are assigned on the borders of the im¬ 
age (see figure]^. A collection of maximum flow problems 
is solved for each pair of foreground and background seeds 
and different A values (the A-schedule), that are used to ex¬ 
press the so called foreground bias associated with the non¬ 
seed pixels. The result is a large and diverse set of segments 
of different sizes and structural (shape) relevance. 

2.2 Trivial Parallelism on Multi-Core Processors 


A list of problems defined by a pair of foreground and 
background seeds and a A-schedule can be solved indepen¬ 
dently on different processing units, given that no two pairs 
of foreground and background seeds are the same. Eor in¬ 
stance, a trivially parallel solution can be implemented by 
using MATLAB’s parfor instruction (or similar instructions 
in other languages) that executes each iteration indepen¬ 
dently as a thread on one of the available processor cores. 

The main advantage of this type of parallelization is sim- 
Trivially parallel parametric max flow on multi-core processors 



2 X 2.2 GHz 8 core CPU 
2.4 GHz 6 core CPU 


Number of threads 


Figure 4. Trivial parallelism performance for figure- 

ground segmentations. 

plicity, both in terms of programming effort and negligi¬ 
ble need of synchronization of the worker threads (thus en¬ 
abling maximum parallelism). From a programming stand¬ 
point, one only needs to mark the appropriate parfor code 
blocks. The programming model is sequential for any par¬ 
ticular parfor code block solving a problem, and the speed¬ 
up comes from the high usage of the available cores of the 
processor. 

However, this parallel solution does not attempt to speed 
up any of the individual problems, which run sequentially. 
Figure shows the mean time in seconds taken by the 
pseudoflow algorithm ifT^ to yield figure-ground hypothe¬ 
ses for a set of 500 images from the VOC lH dataset, with 
a schedule of 20 A values and 178 seeds, each. Please note 
that, in practice, even with a relatively small number of pro¬ 
cessing units (cores), the speed-up of the trivially parallel 
solution flattens out quite quickly. As soon as 10 cores are 
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used, the performance of the slower processor (Intel Xeon 
E5-2660) starts to saturate above 10 seconds, still far from 
real-time expectations. Even on the faster processor (In¬ 
tel Xeon E5-2620 v3) the execution times get close to the 
other processor times as soon as 6 cores are used. This lack 
of scalability motivates the need to investigate parallel so¬ 
lutions for the parametric maximum flow solver as well. 

3 Parallel Parametric Max Flow Solution 

To the best of our knowledge, there is no available par¬ 
allel implementation of a parametric maximum flow algo¬ 
rithm. A few prior articles focus on the topic of paral¬ 
lel implementations of maximum flow ElEl, but don’t 
offer code. One available implementation is GridCut IT], 
and works for multi-core processors only. It defines a grid 
of computing units that can process graph cuts in paral¬ 
lel based on a popular augmenting path algorithm featur¬ 
ing tree-reuse Q. GridCut implements adaptive bottom-up 
merging ll^ and cache efficient memory layout EOl . Other 
available implementations of max flow algorithms are GPU 
implementations ||2l 0 |25l ED . The NVIDIA NPPI library 
EE) implements a push-relabel algorithm m. 

Given the circumstances, running a parallel parametric 
maximum flow algorithm proves challenging. One solution 
would be to seek an “approximation” of the parametric be¬ 
havior (in the sense defined in j j2.1| l by using a preset A- 
schedule, and run a parallel maximum flow routine once for 
each A value in the schedule. However, this “batch” call 
is far from optimal, since it is proven that, in a monotonic 
case, a parametric maximum flow algorithm can run asymp¬ 
totically close to a regular maximum flow algorithm 03 , 
i.e. with the same theoretical complexity. Hence, this batch 
procedure is a poor match to what an optimal parallel para¬ 
metric maximum flow algorithm could achieve in theory. 

3.1 Parametric Max Flow with Supergraphs 

Besides the inability to optimize computations in the 
monotonic case, another shortcoming of the batch method 
is that running a single graph cut problem at the time, either 
on a multi-core processor or a GPU architecture, might not 
use the available hardware resources to the fullest. This be¬ 
comes striking especially for the latest generation of GPU 
boards that feature thousands of computing cores. 

To address the issue, consider running several graph cut 
problems simultaneously on the available parallel comput¬ 
ing infrastructure. This is not straightforward, since the 
programming interfaces of parallel graph cut routines sup¬ 
plied by software like GridCut or CUBA NPPI take a sin¬ 
gle graph as parameter. The solution is to “knit” together 
several graphs representing different problems into a larger 
graph, that we call a “supergraph”, and to pass it on, as a 
parameter to the graph cut calls. 

These supergraphs represent the building block of our 
parallel framework for parametric max flow problems in 


image segmentation and can be constructed at two lev¬ 
els: A and seed level. A A-supergraph combines together 
graphs for several A values, whereas a seed supergraph com¬ 
bines several A-supergraphs together. Usually, our struc¬ 
tures combine an entire A-schedule, the list of the As that 
we run the parametric max flow with, but it is possible to 
have smaller supergraphs as well. In the case of seed super¬ 
graphs though, we use only A-supergraphs constructed for 
an entire A-schedule. 

Combining two graphs into a supergraph can be sim¬ 
ply done by inserting additional vertices “between” the two 
graphs and by linking them to the regular vertices from the 
left and right graph by means of zero-weight edges (see fig¬ 
ure 1^. Inductively, one can build arbitrarily large super¬ 
graphs out of individual graphs. Any minimum cut in a su¬ 
pergraph built like that is a union of the disjoint minimum 
cuts of the original graphs knitted together, plus some zero- 
weight edges that do not count towards the overall cost of 
the supergraph cut. Therefore, computing minimum ener¬ 
gies associated with a pair of foreground-background seeds 
and a given A value can be derived from such a supergraph 
by decomposing the minimum supergraph cut into its in¬ 
dividual minimum cut components. In other words, com¬ 
puting a supergraph maximum flow/minimum cut approxi¬ 
mates the behavior of a parallel parametric maximum flow 
algorithm running on the individual graphs. 

Source s 
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Figure 5. Supergraph composed out of k individual 
graphs Gi, G 2 , ■ ■ - Gk (best seen in colors). 

To see why this works, consider the following situation. 
Let 5 be a supergraph composed of k individual graphs Gi, 
G 2 , ■■ - Gk (see flgure|^. Let’s assume C is a minimum cut 
in S, and Gi, G2, ... Cfc are the minimum cuts of the individ¬ 
ual graph components. Suppose that one of the individual 
cuts, say Gi, is not a minimum cut in Gi. Then, there is a 
minimum cut G[ in Gi, different than Gi. Since Gi is linked 
by means of zero weight edges to its neighboring graphs in 
S, any minimum supergraph cut that crosses Gi needs to 
include C', because the zero-weight crossing edges do not 
contribute to the overall cost, C' is the minimum cost path 
severing Gi into two partitions and the supergraph cut must 
somehow cross Gi. Therefore, there is another supergraph 
cut including C' of smaller cost than C, which contradicts 
the assumption that C is a minimum supergraph cut. Thus, 
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all of the CiS must be minimum cuts in their corresponding 
G,s. 

Conversely, suppose that there is a supergraph cut C’ in 
S whose cost is smaller than that of C. Then, there is an¬ 
other union of individual minimum cuts CJ U U ... U 
that compose C’. Since C's are disjoint sets, it means 
that at least one of them from C’, say C', is smaller than 
its corresponding Cj, which contradicts the assumption that 
Cj is a minimum cut in Gi. 

Therefore, it is sound to use this procedure to amass sev¬ 
eral graphs into a supergraph and use the individual com¬ 
ponents of the minimum supergraph cut as individual min¬ 
imum cuts. Thus, one can solve simultaneously either a 
single seed problem (by building a A-supergraph) or several 
seed problems (by binding together several A-supergraphs 
for several seeds). The ability to run custom sized graphs 
ends up in a better usage of the available computing power 
of the underlying hardware architecture. 

We have used supergraphs both with GridCut and CUBA 
NPPI, but our CPMC case study focuses on the GPU solu¬ 
tion, as our evaluations showed that GridCut performs sig- 
nihcantly worse than CUBA. Nevertheless, we emphasize 
that the supergraph method is general and can be used with 
any available parallel graph cut implementation as a means 
to compute a parametric maximum flow in parallel. The 
method is especially effective when the parallel graph cut 
source code is not available (e.g. CUBA NPPI). 

3.2 Exposing Additional Supergraph Parallelism 


It is known that exchanging the roles of source and sink, 
operation that we call an s-t swap, does not affect the results 
of a graph cut algorithm (i.e., the maximum flow/minimum 
cut remain the same). However, it might help a parallel 
implementation of a push-relabel algorithm, like NVIBIA’s 
nppiGraphcut uma , run faster 1231 . The reason for this be¬ 
havior is that the parallel workload at every iteration of the 
algorithm is given by the number of regular vertices (pixel 
nodes) that have residual capacity on their edges to/from 
source independent of the edges to/from sink 12^ . So, if 
there are more such vertices for the sink than for the source, 
swapping them exposes more parallelism. 

Choosing the source and the sink by running the algo¬ 
rithm twice, once with the original graph and then after 
an s-t swap, to see which run yields faster results, is obvi¬ 
ously not a solution. Instead, we use an heuristic to choose 
the source and sink. For each regular vertex in the graph, 
the difference between the capacities of its source and sink 
edges is computed. Then, for each vertex, we separately 
add the positive and negative differences. If the number of 
negative differences turns out to be larger than that of pos¬ 
itive ones, we apply the s-t swap. This procedure leads to 
more hardware resources active per algorithm iteration and 
improves performance signihcantly (see 14.3 1 . 


When using s-t swaps with supergraphs, one has to prop¬ 
erly choose the source and sink so that every individual 
graph is aligned for maximum available parallelism. Thus, 
all the individual graphs composing a supergraph must be 
checked if they need to be s-t swapped so that the result¬ 
ing supergraph has a source and a sink that allow the high¬ 
est possible degree of parallelism. That is easier done for 
A-supergraphs, because such a supergraph represents the 
same problem (i.e., the same foreground-background pair 
of seeds), but care must be taken when building seed super¬ 
graphs, that might need to reverse some of the individual 
A-supergraphs. 

3.3 Using the Parallel Framework 

A collection of seed problems, each represented by a 
pair of foreground and background seeds and a A-schedule, 
is going to be encoded by means of supergraphs, as pre¬ 
viously described. The resulting set of supergraphs can 
have a smaller size than the collection of seed problems 
if several such problems are expressed by means of seed 
supergraphs. Each resulted supergraph gets scheduled for 
parametric maximum flow processing on a given comput¬ 
ing node (also called server), either locally or remote. Re¬ 
mote processing is achieved by means of Remote Procedure 
Calls (RPC) for the supergraph cut routines. The scheduling 
is controlled by a master node, which runs the image seg¬ 
mentation algorithm. The master node can act as computing 
server as well, but in this case the local computing architec¬ 
ture, either CPU- or GPU-based, is going to be accessed 
directly instead of performing an RPC. The resulting clus¬ 
ter of servers collaborating to solve the collection of seed 
problems may be heterogeneous, regardless whether opera¬ 
tion is on CPUs or GPUs. 

3.3.1 Supergraph Scheduling 

The master node can perform two types of parallel, non- 
preemptive scheduling: static and dynamic. Static schedul¬ 
ing assigns supergraphs to computing servers by using a 
MATLAB parfor instruction with n threads in which each 
parallel loop iteration i gets allocated task i mod n. All the 
tasks allocated to a class i mod n, e.g., those that execute 
an RPC to a given remote server, will be executed sequen¬ 
tially and non-preemptively, one after the other. Thus, the 
makespan, the maximum value among the completion times 
of the tasks, will be determined by the time needed to run 
the longest class i of tasks i mod n. 

However, in a heterogeneous computing environment 
with different hardware architectures (for instance, differ¬ 
ent types of GPU boards, as in our case study), signihcant 
computational load imbalances may arise. Even the same 
hardware, say two GPU boards of the same kind, will not 
yield the same performance when accessed locally vs. over 
the network via RPC. Moreover, there is an intrinsic source 
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of imbalance in the segmentation problem, because differ¬ 
ent seeds of an image induce different image graphs and 
therefore different graph cut computational costs. 

The dynamic scheduler, which is also multi-threaded, 
attempts to offset these load imbalances by picking up a 
server from a list of available ones in FIFO order and ex¬ 
ecuting the RPC (or a local call, in the case of the master 
node) to that node with a supergraph as a parameter. The 
server is removed from the list and, later on, when the super¬ 
graph cut call finishes, is inserted back into the list. Hence, 
the list of available servers grows and shrinks dynamically 
and, at times, may become empty, in which case no server is 
available for computation and the master program dispatch¬ 
ing the tasks gets blocked. 

In contrast to static scheduling, this dynamic policy that 
handles a supergraph as soon as a server is available of¬ 
fers a more balanced mix of task overlaps, which in turn 
should contribute to a smaller makespan. Optimal schedul¬ 
ing of independent, non-preemptable tasks to minimize the 
makespan is known to be NP-hard lfT9l . However, a priori 
knowledge about the supergraph cut processing times may 
improve the worst-case performance of the scheduling algo¬ 
rithm. For instance, sorting the task list in non-increasing 
order of processing times before scheduling and assign¬ 
ing the hrst available task to the hrst available server dur¬ 
ing scheduling, policy known as Largest Processing Time 
First (LPT), is an effective way to minimize the worst-case 
makespan QS). In our case, this a priori information is hard 
to get though, because it is highly data dependent. 

The scalability of the framework depends on two main 
components: the scalability of the master scheduler and that 
of the parallel graph cut routine processing the supergraphs. 
For the latter, our framework is limited by the scalability 
of the available software used (i.e., GridCut, CUBA NPPI, 
etc). The scalability of the scheduler is inhuenced by the 
available resources on the master node and the network la¬ 
tency for remote communication. However, as our exper¬ 
iments show (see (4.6 and §4.8[), achieving near real-time 


performance doesn’t necessarily assume many computing 
nodes. Therefore, one can conclude that the master sched¬ 
uler shouldn’t face scalability problems as long as it can use 
small-scale multiprocessor (6-8 cores) machines. 

3.3.2 Supergraphs and Network Communication 

Typical sizes of the images in the VOC dataset used 
in our case study amount to roughly 80-100K pixels, and 
so are the sizes of the corresponding image graphs. Usu¬ 
ally, for computational reasons, image segmentation algo¬ 
rithms like CPMC downsample images to half, so the re¬ 
sulting graph size sums up to approximately 160-200KB 
of memory (for 4-byte boats or integers). Packing several 
such graphs into a supergraph can enlarge the size of the 
RPC parameters even further. Moreover, library calls like 
NVIDIA’S nppiGraphcut ||2][3 require hve such large ma¬ 


trices as parameters, among others. As a result, the overall 
size of the RPC parameters that have to be transferred over 
the network tends to be quite large and thus may have a 
negative impact on the performance of the call. 

One possible solution to alleviate the consequences of 
transferring large amounts of data over the network is to 
minimize the number of transfers by packing several graphs 
into a larger supergraph (say, instead of sending a single 
A-supergraph parameter, one might send a two seed super¬ 
graph, i.e. two A-supergraphs). Thus, the overhead of the 
send/receive network operations gets amortized over larger 
amounts of data and the transfer performance increases. 

One could also attempt to make better use of the avail¬ 
able bandwidth by overlapping communication with com¬ 
putation. Issuing two concurrent RPCs to the same server 
results in an overlap of the execution of the hrst call with the 
transfer of the parameters of the second call. Naturally, han¬ 
dling two concurrent RPCs requires multi-threaded server 
capabilities. Given that the TI-RPC Linux package does not 
include multi-threaded support for server side RPC (unlike 
the original Sun Microsystems/Oracle version), we had to 
implement a multi-threaded RPC server as well, but this 
choice turned out to be benehcial in our case study for 


CPMC using NVIDIA’S NPPI library (see (4.5 1 . 


4 Case Study: CPMC 

The CPMC release M can use two parametric maximum 
How algorithms uniEi. In our evaluation, we have cho¬ 
sen the pseudohow algorithm m because it can also run 
“approximately” (see j ]2.1| i, i.e. without computing all the 
breakpoints, by accepting as argument a preset A-schedule. 
Thus, the whole CPMC algorithm runs faster and the com¬ 
parison to our framework is fair. The other option nini 
works only by computing all the breakpoints online. 

In this setup, CPMC iteratively solves a list of indepen¬ 
dent problems dehned by a pair of foreground and back¬ 
ground seeds and a A-schedule passed to the pseudohow al¬ 
gorithm. The problem solver is implemented in MATLAB, 
while the pseudohow solver is implemented in C (hooked 
up with the MATLAB code by means of MEX libraries). 
Thus, a trivially parallel solution can be easily implemented 
by using MATLAB’s parfor instruction that executes each 
iteration independently as a thread on one of the available 
processor cores. 


Motivated by the argument in (2.2 we compared the 


pseudohow based solution to that of the supergraph frame¬ 
work, which parallelizes the hgure-ground stage of CPMC 
El, in order to assess its utility as a tool towards real-time 
performance for image segmentation. To that end, we have 
employed a cluster of GPUs managed by the framework as 
described in ^ The GPU cards have run the push-relabel 
implementation of the NVIDIA NPPI library |l2][3]. With no 
access to the library source code, we had to use the NVIDIA 
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code as a black box, with no possibility to perform any kind 
of code optimization. 

4.1 Evaluation Setup 

The evaluation has been driven on two types of HP work¬ 
stations: three Z840 stations equipped with one Intel(R) 
Xeon(R) CPU E5-2620 v3 @ 2.40GHz processor (6 cores) 
and 32GB RAM, and one Z420 station equipped with an In- 
tel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz processor (6 
cores) and 32 GB RAM. We used five NVIDIA GPUs for 
the experiments, two Tesla K40 (one per Z840 station) and 
three Titan Black boards (two hosted on a Z840 machine, 
and the third on the Z420 station). A Tesla K40 board fea¬ 
tures 2880 cores clocked at 745MHz and 12 GB of RAM, 
while a Titan Black board has 2880 cores running at 889 
MHz and 6 GB of RAM. We used CUBA 6.5 for the ex¬ 
periments. All the systems run Linux and are connected by 
Gigabit Ethernet. 

Unless otherwise stated, all the experiments use a 500 
image subset of the VOC2012 dataset H) and evaluate over 
this subset of images the minimum, maximum, and aver¬ 
age time values, taken by the graph cuts of the CPMC Do) 
figure-ground segmentation stage that yields the segment 
hypotheses. This stage uses three different Segmenter meth¬ 
ods for a total of 178 seeds and 20 A values for each of these 
seeds (these are the default values, for details see 13 ). 

4.2 Performance of the Pseudoflow Algorithm 

The first experiment attempts to assess the performance 
of the pseudoflow algorithm M on multi-core architec¬ 
tures. As already mentioned, this is a sequential parametric 
maximum flow algorithm and can be easily used with the 
parfor MATLAB instruction to implement a form of trivial 
parallelism on multi-core architectures. This parallel algo¬ 
rithm will represent our multi-core baseline performance. 


Time 

Min 

Avg 

Max 

1 parfor thread 

13.35 s 

46.59 s 

72.59 s 

2 parfor threads 

8.14 s 

28.29 s 

42.70 s 

4 parfor threads 

5.35 s 

17.71 s 

26.23 s 

6 parfor threads 

4.10 s 

14.42 s 

21.48 s 


Table 1. Parametric pseudoflow performance on 
multi-core architectures. 

We ran CPMC with the pseudoflow solver on a 6-core 
Z840 workstation with a A-schedule of 20 values by vary¬ 
ing the number of threads of the parfor instruction from 1 
to 6. The results are shown in table and represent the 
average, maximum and minimum times for performing the 
graph cuts of the CPMC figure-ground segmentation stage. 
These numbers appear also in figure and show that in¬ 
creasing the number of cores doesn’t help in the long run. 


4.3 Basic Performance of Push Relabel on GPUs 

Our next experiment on the VOC image subset attempts 
to shed some light on the local performance of our GPU 
cards (that is, without RPC) when using the methods de¬ 
scribed in We ran the experiments on the Z840 worksta¬ 
tions using A-supergraphs (i.e., one seed supergraphs) with 
20 A values when calling the NVIDIA NPPI nppiGraph- 
cut routine. Eirst, we show in table the performance of 
the GPU boards without using supergraphs at all. This is 
the method we called “batch” in (j^that calls iteratively the 
nppiGraphCut routine for every A value in the schedule. 
These results are the baseline for our next comparisons. 


Time 

Min 

Avg 

Max 

Tesla K40 

61.42 s 

140.88 s 

256.54 s 

Titan Black 

45.07 s 

102.28 s 

181.38 s 

fable 2. Batch performance of NVIDIA’s push rela- 
lel implementation. 

Time 

Min 

Avg 

Max 

tes'ia K40 

i7.70 s 

63.53 s 

167.i6s 

Titan Black 

13.61 s 

47.47 s 

122.69 s 


Table 3. Performance of NVIDIA’s push relabel im¬ 
plementation without s-t swaps. 

Table 1^ shows the performance figures of the two boards 
when using supergraphs without applying the s-t swap opti¬ 
mization (see §3.2[ ). The comparison to table [^reveals that 
the use of supergraphs reduces the average figure-ground 
segmentation latency more than two times for both type 
of boards, thus making a strong case for the use of super¬ 
graphs. Please also note the minimal latencies, where the 
supergraph method yields at least three times lower figures. 


Time 

Min 

Avg 

Max 

Tesla K40 

10.45 s 

32.80 s 

54.76 s 

Titan Black 

8.40 s 

25.74 s 

42.77 s 

2 X Titan Black 

4.53 s 

13.93 s 

22.77 s 


Table 4. Performance of NVIDIA’s push relabel im¬ 
plementation using properly s-t swapped supergraphs. 
Table l^presents the results of using supergraphs that are 
properly s-t swapped for optimal performance. The third 
row of the table presents the results of using simultaneously 
both of the Titan Black cards in one of our Z840 stations. 
On average, the Titan Black board takes roughly 78% of the 
K40 time and is almost 55% faster than the trivially parallel, 
single threaded algorithm (see table [T]). It is also worth not¬ 
ing that two Titan Blacks together outperform, on average, 
the trivially parallel algorithm on six threads. 

The comparison to table shows that the s-t swap oper¬ 
ation on supergraphs cuts almost to half the figure-ground 
segmentation latency for Tesla K40 and by roughly 46% for 
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Titan Black, on average. Also noteworthy is that the maxi¬ 
mum latency values of supergraphs that don’t use s-t swaps 
are 3, respectively 2.86 times larger, which shows how poor 
can be, at times, the degree of available parallelism if the 
source and sink are not properly swapped. 

4.4 Impact of Seed Supergraphs 

Using seed supergraphs (see §3.1| l should achieve better 
usage of the underlying hardware. We assessed the perfor¬ 
mance of NVIDIA’S push relabel implementation on seed 
supergraphs when varying the number of seeds. Table 
shows the results on a Z840 station for two seeds (four 
seed supergraphs have shown only marginally better fig¬ 
ures). The comparison to one seed supergraph figures (table 
1^ proves roughly 10% improvements, on average. 


Time 

Min 

Avg 

Max 

Tesla K40 

8.82 s 

29.78 s 

54.90 s 

Titan Black 

7.04 s 

23.49 s 

40.63 s 

2 X Titan Black 

4.03 s 

12.79 s 

22.45 s 


Table 5. Impact of seed supergraphs (2 seeds). 


4.5 RPC Performance 

We also conducted experiments to evaluate the perfor¬ 
mance of our RPC version of the NVIDIA nppiGraphcut 
call to access remote GPU boards. The experiments have 
used a Tesla K40 board, Z840 workstations and A-schedules 
of 20 values. The first row of table presents the graph 
cut times to get the figure-ground segment hypotheses when 
using A-supergraphs only. We also evaluated the optimiza¬ 
tions discussed in p.3.2| by running larger supergraphs (two 
seed supergraphs) to make a better usage of the available 
network. One can see that the difference between one and 
two seed performance in table is larger than it is for local 
Tesla K40 calls (see tables|^and|^. This difference can be 
accounted to better network usage in the case of larger su¬ 
pergraphs. The last rows of tablej^show that multi-threaded 
RPC servers improve the performance when two consecu¬ 
tive GPU RPCs are submitted to the same server in order to 
overlap communication with computation. 


Time 

Min 

Avg 

Max 

1 seed 

17.28 s 

48.14 s 

74.99 s 

2 seeds 

13.91 s 

42.71 s 

72.29 s 

MT server (1 seed) 

14.59 s 

39.14 s 

62.51 s 

MT server (2 seeds) 

11.14s 

33.87 s 

59.82 s 


Table 6. Performance of remote GPU RPCs. 


4.6 Parametric Max Flow using Supergraphs and 
Clusters of GPUs 

Once the performance of the individual boards and 
mechanisms has been assessed, we proceeded to set up a 
cluster of GPUs that would act as a parallel, cluster-wide 
parametric maximum flow solver based on supergraphs and 


the insights gained from the previous experiments. Thus, 
we decided to use the Z840 workstation equipped with two 
Titan Black cards as a master node, since these cards are 
faster than the K40s and local access to them should be 
faster than by means of RPC. The other three stations, two 
Z840 equipped with one Tesla K40 card each and the Z420 
machine hosting the third Titan Black board have been used 
to build the cluster of GPU servers. We tested with A- 
schedules with 20 values and two-seed supergraphs, as us¬ 
ing such supergraphs has shown the best performance. We 
varied the number of boards in the cluster from three to five 
and compared the results with those of CPMC using the 
pseudoflow solver on a 6-core Z840 machine. The figures 
are shown in table [T] and reveal better overall times for the 
graph cuts of the figure-ground segmentation stage based on 
clusters of GPUs. In terms of average times, the GPU clus¬ 
ter solutions take roughly 72%, 62% and 55%, respectively, 
of the time needed to run the trivially parallel solution using 
the pseudoflow algorithm on a 6-core processor. 


Time 

Min 

Avg 

Max 

Pseudoflow 6 threads 

4.10 s 

14.35 s 

21.48 s 

2 Titan Black H- 1 K40 

3.49 s 

10.34 s 

17.86 s 

2 Titan Black H- 2 K40 

2.91 s 

8.88 s 

15.04 s 

3 Titan Black H- 2 K40 

2.67 s 

7.85 s 

12.16 s 


Table 7. The performance of GPU clusters vs. the 
multi-core based solution. 

4.7 Graph Cut Accuracy 

So far, our evaluations concerned running times, but 
CPMC and image segmentation algorithms in general need 
also to fulfill their primary goal of accuracy. The accuracy 
of an image segmentation algorithm is influenced by sev¬ 
eral factors, the graph cut calculation accuracy being an im¬ 
portant one. Therefore we proceeded to an evaluation of 
the performance of the previously tested algorithms also in 
terms of accuracy. 

An image segmentation accuracy measure is a similar¬ 
ity measure, defined according to the VOC challenge rules 
i) as the degree of overlap between the set of segments 
(pixel masks) S resulted from the image segmentation algo¬ 
rithm and the ground truth G (correct image segmentation 
masks delineated by hand, provided for reference). Alterna¬ 
tive measures include the F-measure Q. The overlap mea¬ 
sure is computed as follows: 

Overlap(S,G) = 

The results are presented in table]^ Note that the overlap 
measure quantifies the accuracy of the whole method (so 
far, we have presented running times for graph cuts in the 
figure-ground segmentation stage of CPMC). 

The difference between the accuracy of CPMC running 
the push-relabel algorithm on supergraphs mapped on a 
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cluster of GPU boards to that of the pseudoflow algorithm 
is around 1% and can be probably accounted to the fact that 
a pseudoflow algorithm is slightly less precise than a push- 
relabel algorithm. The point of this experiment is to show, 
however, that the NVIDIA implementation can provide re¬ 
liable, high quality segments for CPMC. The comparison to 


the pseudoflow figure ascertains that. 

Pseudoflow 

Push-relabel 

on GPU cluster 

Avg. overlap (20 A) 0.734 

0.743 


Table 8. Image segmentation performance of CPMC 
in terms of intersection over union overlap. 

4.8 Segmentation Accuracy vs. Speed Trade-Off 


The CPMC release 121 sets the default A-schedule size 
to 20 values, as this choice has proven to yield the best 
results in the VOC challenges n. However, one can al¬ 
ways trade off segmentation accuracy for improved running 
times. In this section, we present the results of halving the 
size of the A-schedule (see table |9]l. We have repeated the 
experiments in (4.6 using two seed supergraphs and 10 A 
schedules. The comparison to tables 0 and (for 2 x Ti¬ 
tan Black) shows a reduction of roughly 40% of the average 
running times (slightly more for the setup with two local 
Titan Black boards). Using four seed supergraphs improves 
only marginally the running times (e.g., for the 5 GPU clus¬ 
ter the average execution time amounts to 4.54 seconds). It 
is also worth noting that the trivially parallel pseudoflow so¬ 
lution using 10 A schedules and six parfor threads doesn’t 
yield this kind of improvement over the 20 A schedule case 
(roughly only 20% decrease). 


Time 

Min 

Avg 

Max 

Pseudoflow 6 threads 

3.79 s 

11.36 s 

16.21 s 

2 Titan Black 

2.46 s 

7.32 s 

12.84 s 

2 Titan Black H- 1 K40 

2.20 s 

6.05 s 

12.03 s 

2 Titan Black H- 2 K40 

1.61 s 

5.26 s 

8.34 s 

3 Titan Black H- 2 K40 

1.68 s 

4.78 s 

7.63 s 

Table 9. The performance of GPU clusters vs. the 
multi-core based solution for a 10 value A-schedule. 


Pseudoflow 

' Push-relabel 

on GPU cluster 

Avg. overlap (10 As) 

0.719 


0.732 


Table 10. Image segmentation accuracy for 10 value 
A-schedules. 

The gain becomes even more important if one considers 
table that depicts the segmentation accuracy (all of the 
GPU-based solutions yield at least 0.732 overlap, so we re¬ 
ported just one figure). Note that, even if the accuracy of 
the segmentation drops by roughly 1% compared to that re¬ 
ported in tablej^ it practically equals that of the pseudoflow 


algorithm using twice as many A values. Thus, one can use 
fewer A values and get significantly improved running times 
at almost no accuracy loss. 

4.9 Scheduling 

Static vs. Dynamic Scheduling As pointed out in 
§3.3.1| we schedule parametric max flow computations on 
supergraphs using a FIFO ordered, dynamically managed 
list of computing servers. In contrast, a static solution 
would assign each iteration of a parfor MATLAB loop com¬ 
puting parametric max flows to a given server in the list. We 
instrumented four experiments using parfor loops with up 
to five threads and all of the GPU board combinations from 
(4.6 Each parfor iteration was statically scheduled mod n, 
where n was the number of threads/boards used, to run su¬ 
pergraph max flows on the GPU boards. We used two seed 
supergraphs and A-schedules of 20 values. Table 11 shows 
the results. The comparison to the dynamic solution (tables 
[7]and[^ for 2 x Titan Black) reveals that dynamic schedul¬ 
ing improves average times over static solutions by roughly 
10%, 32%, 34% and 30%, respectively. 


Time 

Min 

Avg 

Max 

2 X Titan B. parfor 

4.58 s 

15.48 s 

27.68 s 

2 Titan B. H- 1 K40 parfor 

4.70 s 

15.25 s 

30.11 s 

2 Titan B. H- 2 K40 parfor 

4.14 s 

13.42 s 

27.98 s 

2 Titan B. H- 3 K40 parfor 

3.57 s 

11.24 s 

18.90 s 


Table 11 . Static scheduling. 

Scheduling Efficiency Since finding the optimal sched¬ 
ule is proven to be NP-hard (see §3.3.l| l, we aimed to find 
how much worse performs our dynamic scheduler com¬ 
pared to a theoretically superior algorithm such as LPT (see 
§3.3.l| l. To match the simplest paradigm of parallel, non- 
preemptive scheduling for makespan minimization, namely 
that of Parallel and Identical Machines m , we gathered the 
running times of our one seed supergraph cuts for the Z840 
machine equipped with two identical Titan Black boards re¬ 
ported in the last row of table|^ We sorted the values in non¬ 
increasing order and applied the LPT algorithm offline. The 
results have shown that LPT makespans are 1.7% smaller, 
on average, than those of our dynamic scheduler (the mini¬ 
mal difference among the 500 images being 0.01% and the 
maximum 8.8%). These results show the efficiency and util¬ 
ity of our scheduler, given that LPT used a priori known 
information and thus cannot be run online, in our case. 

5 Conclusions 

In this paper we have presented a solution to approxi¬ 
mate parallel parametric maximum flow behavior for im¬ 
age segmentation problems based on supergraphs. We have 
also introduced a general, parallel framework that can run 
parametric maximum flow problems on various platforms 
(multi-core, GPU), either locally or distributed in a clus¬ 
ter, as instructed by a provably efficient dynamic sched- 
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uler. The framework is also useful as an evaluation tool 
of the available parallel maximum flow implementations 
EEHl. We report the results of using NVIDIA’s GPU im¬ 
plementation of the push relabel maximum flow algorithm 
together with CPMC iflOl . a state-of-the-art image segmen¬ 
tation algorithm, as a case study that points out the utility 
of our framework. The evaluation has shown that our solu¬ 
tion achieves near real-time performance, practically with¬ 
out any segmentation accuracy loss. 
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